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1. INTRODUCTION 

The database is a very integral part of any software application and relational database is the most 
popular type of database system available. Designing a database is a very tedious and critical part of any 
software engineering process. A large share of the total time is spent on designing the database. To design a 
database, professional database designers are required to avoid any problems in the future. So, in this ever- 
advancing world of technology, there is tremendous growth in the field of software engineering and hence 
there is great demand for efficient data storage and processing techniques. Hence under this big hood of 
software development, there is an implicit need of automating the entire process of designing a database 
which is a key support of any software project. 

A huge amount of data is generated and stored in a structured and unstructured format. Database 
schemas are a structured method of storing the data. Even though the data is structured, there exist multiple 
schemas and tables which are related, but the relations cannot be established by simple databases. One of the 
methods to represent schemas as a combined knowledge source is to design a knowledge graph (KG). The 
term “KG” was known to be used in writing since no less than 1973 in the paper presented by Edward W. 
Schneider. Google created its own knowledge graph [1], which was announced in 2012, is a more 
sophisticated embodiment of the knowledge graph ever seen in the technology world [2]. Following then, 
knowledge graphs have become more important in terms of research and application. As a consequence of 
the research that took place, quite a few KG products were developed. Cyc [3], Freebase [4], Google 
knowledge vault [5], DBpedia [6], YAGO [7], NELL [8], PROSPERA [9], Microsoft’s Probase [10] are 
among the prominent KGs ever created. Many real-world applications like Semantic Web search engine [11], 
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named entity recognition and disambiguation [12], [13], Extracting Information [14], [15], Recommendation 
systems [16], and Siri [17], Apple’s personal assistant, Watson [18] from IBM also make use of generic 
knowledge graphs. Even in the field of medicines, KG called as Medical KG [19] is gaining lot of attention 
recently. 

Graph neural network (GNN) is also one of the techniques that are used in graph applications, the 
input method in this also works the same as that of the neural network system, but the nodes are converted 
into graph embeddings which is an important step involved in any GNN model. The main reason behind 
using GNN would be to be able to predict relationships(edges) between completely unrelated nodes, 
technically known as Edge prediction [20], [21]. Zhao et al. [22], discusses KG’s construction and 
architecture. It also gives an outline of KG's construction and looks at the approaches and difficulties that 
were encountered during the process. Building KG requires Semantic relatedness which plays a key role in 
linking a word with an existing word in our knowledge graph. It is a quantitative measure of how two words 
or concepts are related under a context, without any discrepancy of the syntactical form of the word. 
Conceptually speaking, semantic relatedness is based on functional relations like hyponymic, hypernymic, 
metronymic [23]. At the point when restricted to hyponymy/hypernymy relations, the action evaluates 
semantic similarities between two words or concepts. 

This paper discusses a method to build a robust knowledge graph and query the same to build a 
database by suggesting tables and columns in an automated way. The proposed system is designed to be 
based on a knowledge graph, where it creates a smaller graph for each of the nuclear schemas and then 
merges this graph with the global KG intelligently based on the similarity between the nodes by comparing 
their word vectors. So, the proposed algorithm not only suggests database schemas but also serves as an 
efficient tool to store information while minimizing redundancy. 


2. METHOD 
This section describes the approach that was applied to produce a knowledge graph of different 
entities. The workflow of the pipeline is shown in Figure 1. In short, the framework includes the following 
steps: 
— Schema to knowledge graph: convert each of the schemas to the knowledge graph 
— Executing the Algorithm: which exploits the link prediction using multiple natural language processing 
and machine learning tools. 
— Storing the data: in which the final knowledge graph is stored in the TinkerPop database. 
— Interaction with the System: in which multiple questions are asked to the user to get complete insights. 
— Query refining and output: in which keywords are extracted using natural language processing tools and 
the final schema is given to the user. 


Remote Knowledge 
Graph 


Figure 1. Shows the flowchart of the AI-based models and experimental methods applied 
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2.1. Schema to knowledge graph 

Given the multiple schemas, the first step is to convert the given schema into a knowledge graph. 
Firstly, each table is converted to a node [24], table attributes are added as the properties of node and relation 
as an edge for the graph and a complete knowledge graph is built. Table attributes are converted to properties 
of nodes so that it is easier to keep the properties of the original schema intact. 


2.2. Executing the algorithm 

This is the core step of the complete architecture. For the algorithm as shown in Table | to work at 
least, two knowledge graphs are required. For each node n/ in graph G1, iterate over each node n2 of graph 
G2. Using the synset similarity [25] function (from NLTK library) the percentage of similarity is calculated 
between the node n/ and 72, if the similarity is 100% (this mean both the nodes are the same) both the nodes 
are merged to form a single node otherwise if the similarity percentage is more than the threshold value, a 
new node n is created i.e., the hypernym of both nodes. Once the new node is created it is then connected to 
the node n/ and n2 by adding new edges from node n/ to n and from n2 to n. 


Table 1. Algorithm to merge two schemas based on similarity 


Algorithm 1: Join Graph 


Data: G1: graph1, G2: graph2 
Result: single merged graph 
for each node nle nodes(G1) do 
for each node n2«€ nodes(G2) do 
if similarity(n1,n2) == 1 then 
combine both nodes to one; 
else 
if similarity (n1,n2) = thresholdvalue then 
create-new-node(n); 
add-edge(n1,n); 
add-edge(n2,n); 
end 
end 
end 
end 


2.3. Storing the data 

Once the final knowledge graph is generated, it is then stored in a graph database i.e., Tinker Pop. 
After storing the knowledge graph in the database, the server is started and now the knowledge graph is ready 
to query. As of now, Gremlin query language [26] has been used to query the graph. 


2.4. Interaction with the system 

Users interact with the system through a chatbot and a sequence of questions is asked to get a 
complete insight into what a user wants. Based on the insights the system will provide the possible nodes 
(tables) to choose from. Once the nodes (table) are selected, the final schema is given to the user with all the 
relevant attributes and relations. 


2.5. Query refining and output 

Once the queries are received from the user, it is then processed to remove the unwanted words like 
articles, propositions, quantifiers. After cleaning the query, the final keyword is extracted, now this keyword 
will be used to extract the final sub-knowledge graph. After extracting the sub-knowledge graph, it is then 
converted into the schema so that it can be easily used by the user. 

Using an example, the complete proposed system is illustrated in the following steps. Two schemas 
are considered i.e., School and student schema, 
Step1: Converting schema to knowledge graph 

The first step converts the schema into a knowledge graph as shown in Figures 2-3, where node 
name is the table name and the attributes are stored in the properties of a node. Edges of the graphs contain 
the properties which also denotes the relationships between the tables. Now these knowledge graphs will be 
fed to the algorithm to generate the final knowledge graph. 
Step2: Executing the algorithm on knowledge graphs 

In this step, previously generated knowledge graphs are fed to the algorithm that predicts the new 
edges and nodes based on the similarity between two nodes (table) and if two nodes are similar then their 
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properties (attributes) are merged to get a single node. The final knowledge is generated as shown in 
Figure 4. The final knowledge graph contains much more context than the individual knowledge graphs. 


Staltype 


Figure 2. School schema to a knowledge graph 
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Figure 4. Merging student and school schema and result is a combined knowledge graph 
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Step3: Query and preprocessing 


In this step, the user-query is processed using the natural language processing modules that split the 
sentences into verbs, nouns and determiners. The irrelevant words (like: articles, preposition) are removed, to 
get only the essential words. Out of these useful words the keyword is extracted out as shown in Figure 5. 


Query : Give the schema for grade 
Processed : [('Give', 'VB'), (‘the', 'DT'), ("schema', ‘NN'), ('for', ‘IN'), ('grade', 'NN')] 


keyword : ('‘grade', 'NN') 


Figure 5. Grade query processing 


Step4: Output schema 

This is the final step of the process, where the keywords extracted in step3 are used to extract the 
subgraph over the final knowledge graph that is generated in step2. Depending on the level of details a user 
wants, the knowledge graph is generated as shown in Figure 6. Based on the output graph, the schema is 
generated with all the attributes. 
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Figure 6. Output grade schema with details up to level 2 when queried over the merged graph 


3. RESULTS AND DISCUSSION 

This section briefly introduces some illustrative queries and results representing typical schema. The 
dataset that was used for experimenting has more than 10 schemas from different domains (like: school, 
employee, sales, registration, and library). The graph in Figure 7 shows the final graph obtained by merging 
all the nuclear graphs according to the algorithm. As we keep on adding the more schemas this graph keeps 
on growing. 

Input Query: 

For the query “Give the schema for students” the query was first preprocessed and the keyword was 
extracted i.e. “Student” shown in Figure 8. After the keyword is extracted, it is then used to extract out the 
subgraph from the main knowledge graph as shown in Figure 7. From the output schema shown in Figure 9 it 
can be seen that the student node has more than 10 directly connected nodes, this shows how much-enriched 
details are provided by the schema. 
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Figure 7. Knowledge graph after merging 11 schemas 


Query : Give the schema for students 
Processed : [('Give', 'VB'), ('the', 'DT'), (‘'schema', 'NN'), ('for', ‘IN'), (‘students', 'NNS')] 


keyword : ('students', 'NNS') 


Figure 8. Query for student schema 
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Figure 9. Student schema when queried over the merged graph 
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For the query “Give the schema for payments” as shown in Figure 10, the query is processed to 
remove the unwanted words, and the keyword extracted is “payments”. Using this keyword subgraph is 
extracted from the merged graph as shown in Figure 7. The subgraph generated has details up to depth 2 as 
shown in Figure 11. 


Query : Give the schema for payments 
Processed : [('Give', 'VB'), ('the', 'DT'), ('schema', 'NN'), ('for', 'IN'), ('payments', 'NNS')] 


keyword : ('payments', 'NNS') 


Figure 10. Query for payments 
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Figure 11. Payments schema when queried over the merged graph 


4. CONCLUSION 

Technological advancements in software industries have led to an increase in the usage of data 
storage systems, especially, relational database systems are being extensively used. Designing a database 
schema takes a considerable share of total development time in any project. Hence, there is a need for 
automating the entire process of designing a database schema. In this paper, various approaches were 
discussed to solve the problem and it was evident that the knowledge graph-based approach suits best for the 
purpose. Here, usage of KG was investigated to dynamically generate schemas that satisfy all the constraints 
discussed in the paper and also reduces the redundancy of the data in the knowledge graph. The existing KG 
built from various schemas is queried to suggest schemas to the user. The proposed model was tested with 
multiple schemas to generate the global knowledge graph and it was queried and verified manually. This 
method helps the user in developing relationships between given schemas to draw the KG and query the 
same to suggest a schema. 
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